A distributed framework for trimmed Kernel k-Means clustering

نویسندگان

  • Nikolaos Tsapanos
  • Anastasios Tefas
  • Nikos Nikolaidis
  • Ioannis Pitas
چکیده

Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. Kernel k-Means is a state of the art clustering algorithm. However, in contrast to clustering algorithms that can work using only a limited percentage of the data at a time, Kernel k-Means is a global clustering algorithm. It requires the computation of the kernel matrix, which takes O(nd) time and O(n) space in memory. As datasets grow larger, the application of Kernel k-Means becomes infeasible on a single computer, a fact that strongly suggests a distributed approach. In this paper, we present such an approach to the Kernel k-Means clustering algorithm, in order to make its application to a large number of samples feasible and, thus, achieve high performance clustering results on very big datasets. Our distributed approach follows the MapReduce programming model and consists of 3 stages, the kernel matrix computation, a novel matrix trimming method and the Kernel k-Means clustering algorithm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

یادگیری نیمه نظارتی کرنل مرکب با استفاده از تکنیک‌های یادگیری معیار فاصله

Distance metric has a key role in many machine learning and computer vision algorithms so that choosing an appropriate distance metric has a direct effect on the performance of such algorithms. Recently, distance metric learning using labeled data or other available supervisory information has become a very active research area in machine learning applications. Studies in this area have shown t...

متن کامل

Scalable Embeddings for Kernel Clustering on MapReduce

There is an increasing demand from businesses and industries to make the best use of their data. Clustering is a powerful tool for discovering natural groupings in data. The k-means algorithm is the most commonly-used data clustering method, having gained popularity for its effectiveness on various data sets and ease of implementation on different computing architectures. It assumes, however, t...

متن کامل

Distributed, MapReduce-based Nearest Neighbor and -ball Kernel k-Means

Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means...

متن کامل

Combination of Transformed-means Clustering and Neural Networks for Short-Term Solar Radiation Forecasting

In order to provide an efficient conversion and utilization of solar power, solar radiation datashould be measured continuously and accurately over the long-term period. However, the measurement ofsolar radiation is not available to all countries in the world due to some technical and fiscal limitations. Hence,several studies were proposed in the literature to find mathematical and physical mod...

متن کامل

Discriminative K-means for Clustering

We present a theoretical study on the discriminative clustering framework, recently proposed for simultaneous subspace selection via linear discriminant analysis (LDA) and clustering. Empirical results have shown its favorable performance in comparison with several other popular clustering algorithms. However, the inherent relationship between subspace selection and clustering in this framework...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Pattern Recognition

دوره 48  شماره 

صفحات  -

تاریخ انتشار 2015